Training & Quality Assessment of an Optical Character Recognition Model for Northern Haida

نویسندگان

  • Isabell Hubert
  • Antti Arppe
  • Jordan Lachler
  • Eddie Antonio Santos
چکیده

In this paper, we are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. We hope that our work can encourage further OCR endeavors for other endangered and/or underresearched languages. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improve The Character Detection System Based On Feature Extraction Algorithm

The character recognition is the major important part in the area of document analysis. Character Recognition could be evaluated on printed text and handwritten text. Printed texture could be from a good quality image. In this research work, we implemented in the OCR approach to improve the recognition of character with Classification approach. We work on filtration techniques to improve the pi...

متن کامل

Performance Improvement of Dot-Matrix Character Recognition by Variation Model Based Learning

This paper describes an effective learning technique for optical dot-matrix characters recognition. Automatic reading system for dot-matrix character is promising for reduction of cost and labor required for quality control of products. Although dot-matrix characters are constructed by specific dot patterns, variation of character appearance due to three-dimensional rotation of printing surface...

متن کامل

Creating Training Datasets For OCR In Mobile Device Video Stream

This paper studies methods of data sampling for training of convolutional neural networks for character recognition. These methods are considered for optical character recognition of machine readable zone (MRZ) of documents captured by a mobile phone camera. Advantages and disadvantages of training on natural and artificial datasets are discussed. In this paper we describe some set of image tra...

متن کامل

Research Report on Bangla OCR Training and Testing Methods

In this paper we present the training and recognition mechanism of a Hidden Markov Model (HMM) based multi-font Optical Character Recognition (OCR) system for Bengali character. In our approach, the central idea is to separate the HMM model for each segmented character or word. The system uses HTK toolkit for data preparation, model training and recognition. The Features of each trained charact...

متن کامل

Evaluation of the Effects of Vegetation Characteristics on Desertification (Case Study: Northern Hableh Roud, Iran)

One of the characteristics in Iranian Model of Desertification Potential Assessment (IMDPA) is vegetation. Sincevegetation is very important factor in the degradation of land, so some indices were determined for this item in order toevaluate desertification potential of arid, semi arid and arid sub humid areas of Iran. The indices included vegetationcondition, exploit and revegetation. To calib...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016